MDP & Q-Learning & SARSA

In this lab, we will introduce the conception of Markov Decision Process(MDP) and two solution algorithms, and then we will introduce the Q-Learning and SARSA algorithm, finally we will use the Q-learning algorithm to train an agent to play "Flappy Bird" game.

Comparison between Q-Learning and SARSA

The images of Q-Learning are from TA. Because I missed the last two code to show the image of Q-Learning.

Comparison of Lifetime

Type Lifetime
Q-Learning
SARSA

The best lifetime of Q-Learning is higher than SARSA by 13000 unit of time. Also when observing the average lifetime, I think Q-Learning is better. In different epochs, lots of epochs exceed 1000, but few of epochs in SARSA exceed 1000.

Comparison of Reward

Type Reward
Q-Learning
SARSA

Equally, the best reward of Q-Learning is higher than SARSA too. Also, Q-Learning's average lifetime is higher than SARSA.

Others

Actually, the trend of lifetime and reward are alike. Because the reward of flappy bird depend on how long the bird survived. Observing overall performance, Q-Learning is better. Because it tend to find the best next action and learn from it.